Downloading the web traffic time series data from kaggle & unzipping it.¶
https://www.kaggle.com/c/web-traffic-time-series-forecasting¶
In [1]:
import kaggle
! kaggle competitions download -c web-traffic-time-series-forecasting -p D:\kaggle\data\web-traffic
In [2]:
! unzip 'D:\kaggle\data\web-traffic\web-traffic-time-series-forecasting.zip' -d 'D:\kaggle\data\web-traffic'
Archive: D:\kaggle\data\web-traffic\web-traffic-time-series-forecasting.zip inflating: D:\kaggle\data\web-traffic/key_1.csv.zip inflating: D:\kaggle\data\web-traffic/key_2.csv.zip inflating: D:\kaggle\data\web-traffic/sample_submission_1.csv.zip inflating: D:\kaggle\data\web-traffic/sample_submission_2.csv.zip inflating: D:\kaggle\data\web-traffic/train_1.csv.zip inflating: D:\kaggle\data\web-traffic/train_2.csv.zip
Import the standard libraries¶
In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [4]:
data = pd.read_csv(r"D:\kaggle\data\web-traffic\train_1.csv.zip")
data.head()
Out[4]:
| Page | 2015-07-01 | 2015-07-02 | 2015-07-03 | 2015-07-04 | 2015-07-05 | 2015-07-06 | 2015-07-07 | 2015-07-08 | 2015-07-09 | ... | 2016-12-22 | 2016-12-23 | 2016-12-24 | 2016-12-25 | 2016-12-26 | 2016-12-27 | 2016-12-28 | 2016-12-29 | 2016-12-30 | 2016-12-31 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2NE1_zh.wikipedia.org_all-access_spider | 18.0 | 11.0 | 5.0 | 13.0 | 14.0 | 9.0 | 9.0 | 22.0 | 26.0 | ... | 32.0 | 63.0 | 15.0 | 26.0 | 14.0 | 20.0 | 22.0 | 19.0 | 18.0 | 20.0 |
| 1 | 2PM_zh.wikipedia.org_all-access_spider | 11.0 | 14.0 | 15.0 | 18.0 | 11.0 | 13.0 | 22.0 | 11.0 | 10.0 | ... | 17.0 | 42.0 | 28.0 | 15.0 | 9.0 | 30.0 | 52.0 | 45.0 | 26.0 | 20.0 |
| 2 | 3C_zh.wikipedia.org_all-access_spider | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 4.0 | 0.0 | 3.0 | 4.0 | ... | 3.0 | 1.0 | 1.0 | 7.0 | 4.0 | 4.0 | 6.0 | 3.0 | 4.0 | 17.0 |
| 3 | 4minute_zh.wikipedia.org_all-access_spider | 35.0 | 13.0 | 10.0 | 94.0 | 4.0 | 26.0 | 14.0 | 9.0 | 11.0 | ... | 32.0 | 10.0 | 26.0 | 27.0 | 16.0 | 11.0 | 17.0 | 19.0 | 10.0 | 11.0 |
| 4 | 52_Hz_I_Love_You_zh.wikipedia.org_all-access_s... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 48.0 | 9.0 | 25.0 | 13.0 | 3.0 | 11.0 | 27.0 | 13.0 | 36.0 | 10.0 |
5 rows × 551 columns
In [5]:
data.shape
Out[5]:
(145063, 551)
Transforming the above data into 'Page', 'date' & 'visits' columns¶
In [6]:
df = pd.melt(data,
id_vars='Page',
var_name='date',
value_name='visits')
df.head()
Out[6]:
| Page | date | visits | |
|---|---|---|---|
| 0 | 2NE1_zh.wikipedia.org_all-access_spider | 2015-07-01 | 18.0 |
| 1 | 2PM_zh.wikipedia.org_all-access_spider | 2015-07-01 | 11.0 |
| 2 | 3C_zh.wikipedia.org_all-access_spider | 2015-07-01 | 1.0 |
| 3 | 4minute_zh.wikipedia.org_all-access_spider | 2015-07-01 | 35.0 |
| 4 | 52_Hz_I_Love_You_zh.wikipedia.org_all-access_s... | 2015-07-01 | NaN |
In [7]:
df.isnull().sum()
Out[7]:
Page 0 date 0 visits 6192931 dtype: int64
In [8]:
print("Start Date : ", df['date'].min())
print("End Date : ", df['date'].max())
Start Date : 2015-07-01 End Date : 2016-12-31
GroupBy 'date' & sum up the 'visits'¶
In [9]:
df = df.groupby(['date']).agg({'visits':'sum'}).reset_index()
df['date'] = pd.to_datetime(df['date'])
df = df[df['date'] >= '2016-01-01'].sort_values(by='date')
df.head()
Out[9]:
| date | visits | |
|---|---|---|
| 184 | 2016-01-01 | 182017309.0 |
| 185 | 2016-01-02 | 194359465.0 |
| 186 | 2016-01-03 | 200097562.0 |
| 187 | 2016-01-04 | 198959845.0 |
| 188 | 2016-01-05 | 187263054.0 |
In [10]:
df.isnull().sum()
Out[10]:
date 0 visits 0 dtype: int64
In [11]:
print("Start Date : ", df['date'].min())
print("End Date : ", df['date'].max())
Start Date : 2016-01-01 00:00:00 End Date : 2016-12-31 00:00:00
In [ ]:
In [ ]:
Visualize the web traffic time series data¶
In [12]:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['date'], y=df['visits'], mode='lines', name='Visits'))
fig.update_layout(title='Website Traffic over Timespan', xaxis_title='Date', yaxis_title='Visits')
fig.show()
Check for Trend¶
In [13]:
def rolling_mean_std(x, y, window_size):
rolmean = y.rolling(window=window_size, min_periods=1).mean()
rolstd = y.rolling(window=window_size, min_periods=1).std()
fig = go.Figure()
# Add traces for original, rolling mean, and rolling std
fig.add_trace(go.Scatter(x=x, y=y, mode='lines', name='Original', line=dict(color='red')))
fig.add_trace(go.Scatter(x=x, y=rolmean, mode='lines', name='Rolling Mean', line=dict(color='black')))
fig.add_trace(go.Scatter(x=x, y=rolstd, mode='lines', name='Rolling Variance', line=dict(color='green')))
# Update layout
fig.update_layout(
xaxis=dict(title='Date', tickformat="%b %Y"),
yaxis=dict(title='Mean & Variance'),
title='Rolling Mean & Standard Deviation of Website Traffic',
legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='right', x=1)
)
fig.show()
In [14]:
rolling_mean_std(x=df['date'], y=df['visits'], window_size=10)
Check for Seasonality¶
In [15]:
from plotly.subplots import make_subplots
from statsmodels.tsa.seasonal import DecomposeResult, seasonal_decompose
def plot_seasonal_decompose(result:DecomposeResult, dates:pd.Series=None, title:str=""):
x_values = dates if dates is not None else np.arange(len(result.observed))
return (make_subplots(rows=4, cols=1, subplot_titles=["Observed", "Trend", "Seasonal", "Residuals"])
.add_trace(go.Scatter(x=x_values, y=result.observed, mode="lines", name='Observed'),row=1, col=1)
.add_trace(go.Scatter(x=x_values, y=result.trend, mode="lines", name='Trend'),row=2,col=1)
.add_trace(go.Scatter(x=x_values, y=result.seasonal, mode="lines", name='Seasonal'),row=3,col=1)
.add_trace(go.Scatter(x=x_values, y=result.resid, mode="lines", name='Residual'),row=4,col=1)
.update_layout(height=900, title=f'<b>{title}</b>', margin={'t':100}, title_x=0.5, showlegend=False))
Additive Decomposition¶
Additive Time Series : Value = Base Level + Trend + Seasonality + Error
In [16]:
additive_decomposition = seasonal_decompose(df['visits'], model='additive', period=30)
plot_seasonal_decompose(additive_decomposition,
title="Seasonal Additive Decomposition on Original data",
dates = df['date']).show()
Augmented Dickey-Fuller (ADF) Test:¶
In [17]:
from statsmodels.tsa.stattools import adfuller
def dickyFullerTest(x):
result = adfuller(x)
print("ADF Statistics: %f" %result[0])
print("p-value: %f" %result[1])
print('Critical Value')
for key, value in result[4].items():
print('\t%s: %.3f'%(key, value))
if result[1] > 0.05:
print("Fail to reject null hypothesis(H0), the data is non-stationary.")
else:
print("Reject the null hypothesis(H0), the data is stationary.")
In [18]:
dickyFullerTest(df['visits'])
ADF Statistics: -2.640856 p-value: 0.084844 Critical Value 1%: -3.449 5%: -2.870 10%: -2.571 Fail to reject null hypothesis(H0), the data is non-stationary.
In [ ]:
In [ ]:
The data exhibits non-stationary behavior, and efforts are being made to transform it into a stationary state.¶
Differencing Method -- First Order Differencing (d = 1)¶
In [19]:
df['visits_diff_1'] = df['visits'].diff()
df.dropna(inplace=True)
df.head()
Out[19]:
| date | visits | visits_diff_1 | |
|---|---|---|---|
| 185 | 2016-01-02 | 194359465.0 | 12342156.0 |
| 186 | 2016-01-03 | 200097562.0 | 5738097.0 |
| 187 | 2016-01-04 | 198959845.0 | -1137717.0 |
| 188 | 2016-01-05 | 187263054.0 | -11696791.0 |
| 189 | 2016-01-06 | 188006324.0 | 743270.0 |
Visualize the fist order differencing (d = 1) data¶
In [20]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['date'], y=df['visits_diff_1'], mode='lines', line=dict(color='blue')))
fig.update_layout(
title='First Order Differencing (d=1)',
xaxis=dict(title='Date', tickformat="%b %Y"),
yaxis=dict(title='Differencing Mean'),
)
fig.show()
Check for Trend for fist order differencing (d = 1) data¶
In [21]:
rolling_mean_std(x=df['date'], y=df['visits_diff_1'], window_size=10)
Check Seasonality for fist order differencing (d = 1) data¶
In [22]:
additive_decomposition = seasonal_decompose(df['visits_diff_1'], model='additive', period=30)
plot_seasonal_decompose(additive_decomposition,
title="Seasonal Additive Decomposition on First Order Differencing (d=1) data",
dates = df['date']).show()
The seasonal frequency (m) identified is 12, indicating a monthly-wise series.¶
Augmented Dickey-Fuller (ADF) Test¶
In [23]:
dickyFullerTest(df['visits_diff_1'])
ADF Statistics: -6.337131 p-value: 0.000000 Critical Value 1%: -3.449 5%: -2.870 10%: -2.571 Reject the null hypothesis(H0), the data is stationary.
In [ ]:
Web Traffic time series data is now turned to be Stationary.¶
In [ ]:
Autocorrelation Analysis¶
In [24]:
from statsmodels.tsa.stattools import acf, pacf
def create_corr_plot(series, plot_pacf=False):
corr_array = pacf(series.dropna(), alpha=0.05) if plot_pacf else acf(series.dropna(), alpha=0.05)
lags = np.arange(len(corr_array[0]))
autocorr_values = corr_array[0]
conf_interv = corr_array[1]
lower_y = conf_interv[:, 0] - autocorr_values
upper_y = conf_interv[:, 1] - autocorr_values
fig = go.Figure()
fig.add_scatter(x=lags, y=autocorr_values, mode='markers', marker_color='#1f77b4', marker_size=12)
for lag, autocorr_val in zip(lags, autocorr_values):
fig.add_shape(type='line',
x0=lag, y0=0,
x1=lag, y1=autocorr_val,
line=dict(color='#3f3f3f'))
# Plot the confidence intervals
fig.add_scatter(x=lags, y=upper_y, mode='lines', line_color='rgba(255,255,255,0)')
fig.add_scatter(x=lags, y=lower_y, mode='lines', fillcolor='rgba(32, 146, 230,0.3)',
fill='tonexty', line_color='rgba(255,255,255,0)')
# Update layout
fig.update_traces(showlegend=False)
fig.update_xaxes(range=[-1, len(lags)])
fig.update_yaxes(zerolinecolor='#000000')
# Set the title
title = 'Partial Autocorrelation (PACF)' if plot_pacf else 'Autocorrelation (ACF)'
fig.update_layout(title=title)
fig.show()
In [25]:
create_corr_plot(df['visits_diff_1'], plot_pacf=False)
q = 2, turned out to be most significant lag from above ACF plot.¶
In [26]:
create_corr_plot(df['visits_diff_1'], plot_pacf=True)
p = 2, turned out to be most significant lag from above PACF plot.¶
In [ ]:
In [ ]:
Time Series Forecasting - ARIMA¶
In [27]:
import itertools
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")
In [28]:
y = df.set_index("date")['visits'].copy()
y.index
Out[28]:
DatetimeIndex(['2016-01-02', '2016-01-03', '2016-01-04', '2016-01-05',
'2016-01-06', '2016-01-07', '2016-01-08', '2016-01-09',
'2016-01-10', '2016-01-11',
...
'2016-12-22', '2016-12-23', '2016-12-24', '2016-12-25',
'2016-12-26', '2016-12-27', '2016-12-28', '2016-12-29',
'2016-12-30', '2016-12-31'],
dtype='datetime64[ns]', name='date', length=365, freq=None)
Fitting the ARIMA(p,d,q) model with order ARIMA(2, 1, 2) from above analysis.¶
In [29]:
arima_model = sm.tsa.ARIMA(y, order = (2, 1, 2)) # ARIMA(p,d,q)
results = arima_model.fit()
print(results.summary().tables[1])
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 1.2487 0.004 316.554 0.000 1.241 1.256
ar.L2 -0.9986 0.003 -340.056 0.000 -1.004 -0.993
ma.L1 -1.2394 0.020 -61.146 0.000 -1.279 -1.200
ma.L2 0.9974 0.033 30.280 0.000 0.933 1.062
sigma2 1.807e+14 1.99e-16 9.1e+29 0.000 1.81e+14 1.81e+14
==============================================================================
In [30]:
pred = results.get_prediction(start=pd.to_datetime('2016-01-02'), dynamic=False)
pred_ci = pred.conf_int()
y_forecasted = pred.predicted_mean
y_truth = y['2016-01-02':]
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
The Mean Squared Error of our forecasts is 261592157080111.44
In [31]:
print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))
The Root Mean Squared Error of our forecasts is 16173810.84
In [ ]:
In [ ]:
Also, Finding the best value for ARIMA(p,d,q)¶
In [32]:
p = range(0, 3)
d = range(1, 2)
q = range(0, 3)
pdq = list(itertools.product(p, d, q))
print('Examples of parameter combinations for ARIMA...')
for i in range(len(pdq)):
print('ARIMA: {}'.format(pdq[i]))
Examples of parameter combinations for ARIMA... ARIMA: (0, 1, 0) ARIMA: (0, 1, 1) ARIMA: (0, 1, 2) ARIMA: (1, 1, 0) ARIMA: (1, 1, 1) ARIMA: (1, 1, 2) ARIMA: (2, 1, 0) ARIMA: (2, 1, 1) ARIMA: (2, 1, 2)
In [33]:
# In ARIMA modeling, AIC stands for Akaike Information Criterion. It is a measure of the relative quality of a statistical model
# for a given set of data. AIC provides a way to compare different models with different numbers of parameters, and choose
# the one that best balances the trade-off between model fit and model complexity.
# The AIC value is calculated as: AIC = 2k - 2ln(L)
# where k is the number of parameters in the model and L is the likelihood function of the model.
# The AIC value is a scalar, and the lower the AIC value, the better the model is considered to fit the data.
# In ARIMA modeling, AIC is often used as a criterion for selecting the optimal values of the ARIMA parameters (p,d,q)
# that provide the best fit to the data. The idea is to fit a range of ARIMA models with different parameter combinations,
# and choose the one with the lowest AIC value as the best model.
best_param = None
min_aic = float('inf')
for param in pdq:
try:
model_arima = sm.tsa.ARIMA(y, order=param)
model_arima_fit = model_arima.fit()
current_aic = model_arima_fit.aic
print('ARIMA{} - AIC:{}'.format(param, current_aic))
# Update minimum AIC and corresponding parameters if the current model is better
if current_aic < min_aic:
min_aic = current_aic
best_param = param
except Exception as e:
print(f'Error for order {param}: {e}')
continue
ARIMA(0, 1, 0) - AIC:13011.268522097826 ARIMA(0, 1, 1) - AIC:13012.865388545795 ARIMA(0, 1, 2) - AIC:13005.394493766033 ARIMA(1, 1, 0) - AIC:13012.951084823228 ARIMA(1, 1, 1) - AIC:13006.142194397306 ARIMA(1, 1, 2) - AIC:13001.203281797129 ARIMA(2, 1, 0) - AIC:13006.219455704744 ARIMA(2, 1, 1) - AIC:13000.38726339147 ARIMA(2, 1, 2) - AIC:12948.089683272314
In [34]:
print('Best Model: ARIMA{} - Minimum AIC:{}'.format(best_param, min_aic))
Best Model: ARIMA(2, 1, 2) - Minimum AIC:12948.089683272314
In [35]:
arima_model = sm.tsa.ARIMA(y, order=best_param)
results = arima_model.fit()
print(results.summary().tables[1])
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 1.2487 0.004 316.554 0.000 1.241 1.256
ar.L2 -0.9986 0.003 -340.056 0.000 -1.004 -0.993
ma.L1 -1.2394 0.020 -61.146 0.000 -1.279 -1.200
ma.L2 0.9974 0.033 30.280 0.000 0.933 1.062
sigma2 1.807e+14 1.99e-16 9.1e+29 0.000 1.81e+14 1.81e+14
==============================================================================
Interpret the Plot Diagnostics¶
In [36]:
# Top left: The residual errors seem to fluctuate around a mean of zero and have a uniform variance.
# Top Right: The density plot suggest normal distribution with mean zero.
# Bottom left: All the dots should fall perfectly in line with the red line. Any significant deviations would imply the distribution is skewed.
# Bottom Right: The Correlogram, aka, ACF plot shows the residual errors are not autocorrelated. Any autocorrelation would imply that there is some pattern in the residual errors which are not explained in the model.
results.plot_diagnostics(figsize=(16, 8)).show()
Validating forecast (Interpolation)¶
In [37]:
pred = results.get_prediction(start=pd.to_datetime('2016-01-02'), dynamic=False)
pred_ci = pred.conf_int()
y_forecasted = pred.predicted_mean
y_truth = y['2016-01-02':]
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
The Mean Squared Error of our forecasts is 261592157080111.44
In [38]:
print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))
The Root Mean Squared Error of our forecasts is 16173810.84
In [39]:
pred = results.get_prediction(start=pd.to_datetime('2016-07-01'), dynamic=False)
pred_ci = pred.conf_int()
ax = y['2016-07-01':].plot(label='Observed Web Traffic')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast Web Traffic', alpha=.7, figsize=(14, 7))
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel('Daily web traffic')
plt.legend()
plt.show()
Producing and visualizing forecasts (Extrapolation)¶
In [40]:
pred_uc = results.get_forecast(steps=40)
pred_ci = pred_uc.conf_int()
observed_trace = go.Scatter(x=y['2016-11-01':].index, y=y['2016-11-01':].values, mode='lines', name='Observed Web Traffic')
forecast_trace = go.Scatter(x=pred_uc.predicted_mean.index, y=pred_uc.predicted_mean.values, mode='lines', name='Forecast Web Traffic')
confidence_interval_trace = go.Scatter(x=pred_ci.index,
y=pred_ci.iloc[:, 0],
fill='tonexty',
line=dict(color='rgba(68, 68, 68, 0.1)'),
fillcolor='rgba(68, 68, 68, 0.1)',
name='95% Confidence Interval')
layout = go.Layout(title='Anticipating Patterns in Website Traffic Over Time',
xaxis=dict(title='Date'),
yaxis=dict(title='Daily website visits'))
fig = go.Figure(data=[observed_trace, forecast_trace, confidence_interval_trace], layout=layout)
fig.show()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Time Series Forecasting - SARIMA¶
In [41]:
p = range(0, 3)
d = range(1, 2)
q = range(0, 3)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
Examples of parameter combinations for Seasonal ARIMA... SARIMAX: (0, 1, 1) x (0, 1, 1, 12) SARIMAX: (0, 1, 1) x (0, 1, 2, 12) SARIMAX: (0, 1, 2) x (1, 1, 0, 12) SARIMAX: (0, 1, 2) x (1, 1, 1, 12)
Fitting the SARIMA model¶
In [42]:
best_param = None
best_seasonal_param = None
min_aic = float('inf')
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(y,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
current_aic = results.aic
print('SARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, current_aic))
# Update minimum AIC and corresponding parameters if the current model is better
if current_aic < min_aic:
min_aic = current_aic
best_param = param
best_seasonal_param = param_seasonal
except:
continue
SARIMA(0, 1, 0)x(0, 1, 0, 12)12 - AIC:12846.060052889701 SARIMA(0, 1, 0)x(0, 1, 1, 12)12 - AIC:12226.501648249074 SARIMA(0, 1, 0)x(0, 1, 2, 12)12 - AIC:11772.523690160298 SARIMA(0, 1, 0)x(1, 1, 0, 12)12 - AIC:12325.343625401916 SARIMA(0, 1, 0)x(1, 1, 1, 12)12 - AIC:12208.284242474936 SARIMA(0, 1, 0)x(1, 1, 2, 12)12 - AIC:11760.029582166388 SARIMA(0, 1, 0)x(2, 1, 0, 12)12 - AIC:11816.614662039878 SARIMA(0, 1, 0)x(2, 1, 1, 12)12 - AIC:11784.179438873303 SARIMA(0, 1, 0)x(2, 1, 2, 12)12 - AIC:11747.1957079029 SARIMA(0, 1, 1)x(0, 1, 0, 12)12 - AIC:12811.0533930879 SARIMA(0, 1, 1)x(0, 1, 1, 12)12 - AIC:12225.807546729648 SARIMA(0, 1, 1)x(0, 1, 2, 12)12 - AIC:11777.831988841932 SARIMA(0, 1, 1)x(1, 1, 0, 12)12 - AIC:12342.430154746453 SARIMA(0, 1, 1)x(1, 1, 1, 12)12 - AIC:12219.971930236874 SARIMA(0, 1, 1)x(1, 1, 2, 12)12 - AIC:11779.830808048628 SARIMA(0, 1, 1)x(2, 1, 0, 12)12 - AIC:11865.521243758252 SARIMA(0, 1, 1)x(2, 1, 1, 12)12 - AIC:11847.831661429642 SARIMA(0, 1, 1)x(2, 1, 2, 12)12 - AIC:11776.28771341421 SARIMA(0, 1, 2)x(0, 1, 0, 12)12 - AIC:12756.81580064029 SARIMA(0, 1, 2)x(0, 1, 1, 12)12 - AIC:12179.388952085586 SARIMA(0, 1, 2)x(0, 1, 2, 12)12 - AIC:11732.885049497905 SARIMA(0, 1, 2)x(1, 1, 0, 12)12 - AIC:12337.177906686102 SARIMA(0, 1, 2)x(1, 1, 1, 12)12 - AIC:12174.899520207899 SARIMA(0, 1, 2)x(1, 1, 2, 12)12 - AIC:11734.864995530104 SARIMA(0, 1, 2)x(2, 1, 0, 12)12 - AIC:11855.459504496972 SARIMA(0, 1, 2)x(2, 1, 1, 12)12 - AIC:11837.625627037462 SARIMA(0, 1, 2)x(2, 1, 2, 12)12 - AIC:11729.517667560749 SARIMA(1, 1, 0)x(0, 1, 0, 12)12 - AIC:12847.169560792358 SARIMA(1, 1, 0)x(0, 1, 1, 12)12 - AIC:12260.981536881758 SARIMA(1, 1, 0)x(0, 1, 2, 12)12 - AIC:11814.120117622264 SARIMA(1, 1, 0)x(1, 1, 0, 12)12 - AIC:12306.79652996729 SARIMA(1, 1, 0)x(1, 1, 1, 12)12 - AIC:12255.042202380906 SARIMA(1, 1, 0)x(1, 1, 2, 12)12 - AIC:11816.114671242507 SARIMA(1, 1, 0)x(2, 1, 0, 12)12 - AIC:11828.044071660855 SARIMA(1, 1, 0)x(2, 1, 1, 12)12 - AIC:11810.5755463816 SARIMA(1, 1, 0)x(2, 1, 2, 12)12 - AIC:11812.31178798001 SARIMA(1, 1, 1)x(0, 1, 0, 12)12 - AIC:12808.463382841677 SARIMA(1, 1, 1)x(0, 1, 1, 12)12 - AIC:12227.556325524929 SARIMA(1, 1, 1)x(0, 1, 2, 12)12 - AIC:11779.38276047206 SARIMA(1, 1, 1)x(1, 1, 0, 12)12 - AIC:12308.048096188937 SARIMA(1, 1, 1)x(1, 1, 1, 12)12 - AIC:12221.876817677297 SARIMA(1, 1, 1)x(1, 1, 2, 12)12 - AIC:11781.382752087411 SARIMA(1, 1, 1)x(2, 1, 0, 12)12 - AIC:11830.530310816703 SARIMA(1, 1, 1)x(2, 1, 1, 12)12 - AIC:11811.363771274686 SARIMA(1, 1, 1)x(2, 1, 2, 12)12 - AIC:11778.101714889488 SARIMA(1, 1, 2)x(0, 1, 0, 12)12 - AIC:12754.673371571062 SARIMA(1, 1, 2)x(0, 1, 1, 12)12 - AIC:12176.12802751713 SARIMA(1, 1, 2)x(0, 1, 2, 12)12 - AIC:11729.513861397629 SARIMA(1, 1, 2)x(1, 1, 0, 12)12 - AIC:12290.93377373556 SARIMA(1, 1, 2)x(1, 1, 1, 12)12 - AIC:12170.88148443775 SARIMA(1, 1, 2)x(1, 1, 2, 12)12 - AIC:11731.513035528089 SARIMA(1, 1, 2)x(2, 1, 0, 12)12 - AIC:11816.572391384905 SARIMA(1, 1, 2)x(2, 1, 1, 12)12 - AIC:11800.008770164233 SARIMA(1, 1, 2)x(2, 1, 2, 12)12 - AIC:11727.832297038127 SARIMA(2, 1, 0)x(0, 1, 0, 12)12 - AIC:12793.93366642181 SARIMA(2, 1, 0)x(0, 1, 1, 12)12 - AIC:12250.138887808927 SARIMA(2, 1, 0)x(0, 1, 2, 12)12 - AIC:11805.23150819676 SARIMA(2, 1, 0)x(1, 1, 0, 12)12 - AIC:12262.469060402706 SARIMA(2, 1, 0)x(1, 1, 1, 12)12 - AIC:12209.132653321116 SARIMA(2, 1, 0)x(1, 1, 2, 12)12 - AIC:11807.209648324753 SARIMA(2, 1, 0)x(2, 1, 0, 12)12 - AIC:11781.3193314958 SARIMA(2, 1, 0)x(2, 1, 1, 12)12 - AIC:11762.41269983144 SARIMA(2, 1, 0)x(2, 1, 2, 12)12 - AIC:11764.346630128577 SARIMA(2, 1, 1)x(0, 1, 0, 12)12 - AIC:12792.72720639366 SARIMA(2, 1, 1)x(0, 1, 1, 12)12 - AIC:12212.954500314521 SARIMA(2, 1, 1)x(0, 1, 2, 12)12 - AIC:11765.701798363096 SARIMA(2, 1, 1)x(1, 1, 0, 12)12 - AIC:12252.450447685607 SARIMA(2, 1, 1)x(1, 1, 1, 12)12 - AIC:12207.23880237034 SARIMA(2, 1, 1)x(1, 1, 2, 12)12 - AIC:11767.687929274362 SARIMA(2, 1, 1)x(2, 1, 0, 12)12 - AIC:11780.257848620455 SARIMA(2, 1, 1)x(2, 1, 1, 12)12 - AIC:11763.02579004219 SARIMA(2, 1, 1)x(2, 1, 2, 12)12 - AIC:11764.988938979253 SARIMA(2, 1, 2)x(0, 1, 0, 12)12 - AIC:12756.124963598191 SARIMA(2, 1, 2)x(0, 1, 1, 12)12 - AIC:12177.38313074468 SARIMA(2, 1, 2)x(0, 1, 2, 12)12 - AIC:11707.078951105072 SARIMA(2, 1, 2)x(1, 1, 0, 12)12 - AIC:12207.905579907023 SARIMA(2, 1, 2)x(1, 1, 1, 12)12 - AIC:12145.788338169237 SARIMA(2, 1, 2)x(1, 1, 2, 12)12 - AIC:11707.619927766373 SARIMA(2, 1, 2)x(2, 1, 0, 12)12 - AIC:11757.538013007066 SARIMA(2, 1, 2)x(2, 1, 1, 12)12 - AIC:11743.740409286165 SARIMA(2, 1, 2)x(2, 1, 2, 12)12 - AIC:11709.086570020396
In [43]:
print('Best Model: SARIMA{}x{}12 - Minimum AIC:{}'.format(best_param, best_seasonal_param, min_aic))
Best Model: SARIMA(2, 1, 2)x(0, 1, 2, 12)12 - Minimum AIC:11707.078951105072
In [44]:
sarima_model = sm.tsa.statespace.SARIMAX(y,
order=best_param,
seasonal_order=best_seasonal_param,
enforce_stationarity=False,
enforce_invertibility=False)
results = sarima_model.fit()
print(results.summary().tables[1])
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 1.2372 0.024 51.197 0.000 1.190 1.285
ar.L2 -0.9801 0.025 -39.078 0.000 -1.029 -0.931
ma.L1 -1.2456 0.035 -35.428 0.000 -1.315 -1.177
ma.L2 0.9325 0.044 21.190 0.000 0.846 1.019
ma.S.L12 -0.9212 0.105 -8.796 0.000 -1.126 -0.716
ma.S.L24 0.2445 0.099 2.467 0.014 0.050 0.439
sigma2 4.281e+14 7.86e-17 5.44e+30 0.000 4.28e+14 4.28e+14
==============================================================================
Interpret the Plot Diagnostics¶
In [45]:
results.plot_diagnostics(figsize=(16, 8)).show()
Validating forecast (Interpolation)¶
In [46]:
pred = results.get_prediction(start=pd.to_datetime('2016-01-02'), dynamic=False)
pred_ci = pred.conf_int()
y_forecasted = pred.predicted_mean
y_truth = y['2016-01-02':]
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
The Mean Squared Error of our forecasts is 460239255099706.6
In [47]:
print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))
The Root Mean Squared Error of our forecasts is 21453187.53
In [48]:
pred = results.get_prediction(start=pd.to_datetime('2016-07-01'), dynamic=False)
pred_ci = pred.conf_int()
ax = y['2016-07-01':].plot(label='Observed Web Traffic')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast Web Traffic', alpha=.7, figsize=(14, 7))
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel('Daily web traffic')
plt.legend()
plt.show()
Producing and visualizing forecasts (Extrapolation)¶
In [49]:
pred_uc = results.get_forecast(steps=40)
pred_ci = pred_uc.conf_int()
observed_trace = go.Scatter(x=y['2016-11-01':].index, y=y['2016-11-01':].values, mode='lines', name='Observed Web Traffic')
forecast_trace = go.Scatter(x=pred_uc.predicted_mean.index, y=pred_uc.predicted_mean.values, mode='lines', name='Forecast Web Traffic')
confidence_interval_trace = go.Scatter(x=pred_ci.index,
y=pred_ci.iloc[:, 0],
fill='tonexty',
line=dict(color='rgba(68, 68, 68, 0.1)'),
fillcolor='rgba(68, 68, 68, 0.1)',
name='95% Confidence Interval')
layout = go.Layout(title='Anticipating Patterns in Website Traffic Over Time ',
xaxis=dict(title='Date'),
yaxis=dict(title='Daily web traffic'))
fig = go.Figure(data=[observed_trace, forecast_trace, confidence_interval_trace], layout=layout)
fig.show()
In [ ]:
In [ ]: